The dataset came from Kaggle but was still a little messy, not sure why the artists column came wrapped in [‘name’]. Additionally some date formatting was inconsistent (as always), I imputed -01-01 if there was no date after the year.
df2 <- df %>% mutate(artists = str_remove_all(artists, "\\['"),
artists = str_remove_all(artists, "\\']"),
artists = gsub(",", " and ", artists),
artists = str_remove_all(artists, "^'|'$"), # work on this a bit more
decade = as.factor(floor(year/10)*10),
year = as.factor(year))
# Also would be cool to try to recognize gender by name and make a dummy columnLook at percentages and cumulatives, looks like no popularity is very common!
There is about the same amount of music every year in this dataframe
Basically newer stuff is more popular with little to no exceptions
Looks like explicitness has a normal distribution compared to popularity
df %>%
mutate(explicit = as.factor(explicit),
popularity = round(popularity/10)) %>%
freqs(popularity, explicit, plot = T,
title = "Popularity by Explicitness",
subtitle = paste("Duncan Gates", Sys.Date()),
results = F) Now we check out the distribution, there’s some really cool stuff here
There’s also some really long songs out there…
This looks more like the actual distribution
Very interesting distribution here
Looks like things are a lot more explicit in 2000-2020 as one might expect, would be interesting to see how when this starts, or what drives it. I also wonder what happened in 1920-1940?
df %>%
mutate(explicit = as.factor(explicit),
new_era = ifelse(year %in% c(2000:2020), 1, 0)) %>%
distr(explicit, new_era)By the way mode is just whether the song is major or minor.
You can even use ggplot2!
Wouldn’t be data science without some random regressions, even more data science/machine learningy since the second one is a log odds table!
df %>%
select(-c(id, name, artists, year, release_date, key)) %>%
corr_cross(top = 10) # Look at top 10 correlations in the data, key messes with this idk whytable <- df %>%
select(-c(id, name, artists, year, release_date, key)) %>%
corr_var(popularity, logs = T, plot = F, top = 10)
table %>% mutate(corr = kableExtra::cell_spec(corr, "html", color = ifelse(corr > 0, "blue", "red"))) %>%
kableExtra::kable(format = "html", escape = F) %>%
kableExtra::kable_styling("striped", full_width = F, position = "center")| variables | corr | pvalue |
|---|---|---|
| popularity_log | 0.890732 | 0 |
| acousticness | -0.573162 | 0 |
| acousticness_log | -0.55757 | 0 |
| energy_log | 0.488822 | 0 |
| energy | 0.485005 | 0 |
| loudness | 0.457051 | 0 |
| instrumentalness_log | -0.300402 | 0 |
| instrumentalness | -0.29675 | 0 |
| danceability | 0.199606 | 0 |
| danceability_log | 0.196287 | 0 |